Skip to content

Agregar procesamiento automático de referencias y soporte markuplib para DOCX#60

Open
eduranm wants to merge 8 commits intoscieloorg:mainfrom
eduranm:issue-03
Open

Agregar procesamiento automático de referencias y soporte markuplib para DOCX#60
eduranm wants to merge 8 commits intoscieloorg:mainfrom
eduranm:issue-03

Conversation

@eduranm
Copy link
Copy Markdown
Contributor

@eduranm eduranm commented Apr 20, 2026

O que esse PR faz?

Agrega la base para procesar automáticamente referencias bibliográficas dentro de markup_doc e incorpora markuplib para lectura estructural de archivos DOCX.

Incluye:

  • registro de markuplib;
  • incorporación de markuplib/ con utilidades para analizar DOCX;
  • nuevas utilidades en markup_doc para procesar y marcar referencias;
  • procesar el documento cargado;
  • disparo automático del procesamiento desde el flujo de creación;

Onde a revisão poderia começar?

Por commits

Como este poderia ser testado manualmente?

Levantar el entorno;
Cargar un DOCX desde el flujo de markup_doc;
Verificar que el documento pase a estado PROCESSING;
Una vez terminado, revisar que las referencias se agreguen estructuradas en el documento procesado.

Algum cenário de contexto que queira dar?

Se enfoca en el procesamiento automático de referencias y en la lectura estructural del DOCX, dejando lista la base para continuar con front, texto y salida XML.

Screenshots

N/A

Quais são tickets relevantes?

#59

Referências

  • N/A

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Este PR agrega la base para analizar DOCX de forma estructural (vía markuplib) e integrar el procesamiento automático de referencias dentro del flujo de markup_doc, disparándolo al crear/cargar un documento.

Changes:

  • Registra las apps markup_doc y markuplib y agrega utilidades base para análisis estructural de DOCX.
  • Incorpora tareas Celery para procesar el DOCX cargado, detectar referencias y persistir el contenido procesado en el documento.
  • Añade hooks/admin de Wagtail para el flujo de carga y sincronización de colecciones/journals desde la API.

Reviewed changes

Copilot reviewed 16 out of 23 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
model_ai/llama.py Ajuste del flujo Gemini en LlamaService (incluye pausa fija tras respuesta).
markuplib/function_docx.py Nuevas utilidades para abrir y extraer contenido/estructura desde DOCX.
markuplib/__init__.py Inicialización del paquete markuplib.
markup_doc/wagtail_hooks.py ViewSets y hooks Wagtail para carga/edición y disparo del procesamiento automático.
markup_doc/tests.py Archivo de tests (placeholder).
markup_doc/tasks.py Tarea Celery para procesar el DOCX y estructurar contenido + referencias.
markup_doc/sync_api.py Sincronización de colecciones y journals desde SciELO Core API.
markup_doc/models.py Modelos y StreamFields para persistir front/body/back y metadatos.
markup_doc/migrations/__init__.py Inicialización del módulo de migraciones.
markup_doc/migrations/0001_initial.py Migración inicial para los modelos de markup_doc.
markup_doc/migrations/0002_alter_articledocx_estatus_and_more.py Ajuste de campos/choices para estatus.
markup_doc/marker.py Utilidades para marcado vía LLM (artículo/referencias).
markup_doc/labeling_utils.py Utilidades de segmentación, extracción de citas APA y mapeo/etiquetado.
markup_doc/forms.py Base de formulario (placeholder).
markup_doc/choices.py Choices/estructura base de etiquetas y reglas de orden.
markup_doc/apps.py AppConfig de markup_doc.
markup_doc/admin.py Admin Django (placeholder).
markup_doc/__init__.py Inicialización del paquete markup_doc.
fixtures/e14790.docx DOCX de ejemplo para pruebas manuales.
config/settings/base.py Registro de markup_doc y markuplib en INSTALLED_APPS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to +65
if model.name_file:
user = User.objects.get(pk=user_id)
refresh = RefreshToken.for_user(user)
access_token = refresh.access_token

#url = "http://172.17.0.1:8400/api/v1/mix_citation/reference/"
#url = "http://172.17.0.1:8009/api/v1/mix_citation/reference/"

# FIXME: Hardcoded URL
url = "http://django:8000/api/v1/reference/"

headers = {
'Authorization': f'Bearer {access_token}',
'Content-Type': 'application/json'
}

response = requests.post(url, json=payload, headers=headers)

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In process_reference(), access_token and url are only set inside if model.name_file:, but headers and requests.post() run unconditionally. If name_file is blank (e.g., using a remote API), this will raise UnboundLocalError. Initialize url/access_token for both branches or return/raise when the required config is missing.

Copilot uses AI. Check for mistakes.
Comment on lines +660 to +673
def match_section(item, sections):
return {'label': '<sec>', 'body': True} if (
item.get('font_size') == sections[0].get('size') and
item.get('bold') == sections[0].get('bold') and
item.get('text', '').isupper() == sections[0].get('isupper')
) else None


def match_subsection(item, sections):
return {'label': '<sub-sec>', 'body': True} if (
item.get('font_size') == sections[1].get('size') and
item.get('bold') == sections[1].get('bold') and
item.get('text', '').isupper() == sections[1].get('isupper')
) else None
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match_section()/match_subsection() index sections[0] and sections[1] without checking length. If sections has fewer than 2 entries (common for short/simple documents), this will raise IndexError. Add guards (e.g., if len(sections) > 0/1) before indexing.

Copilot uses AI. Check for mistakes.
Comment on lines +700 to +752
if not result:
result = {'label': '<p>', 'body': state['body'], 'back': state['back']}
state['label'] = result.get('label')
state['body'] = result.get('body')
state['back'] = result.get('back')

if result:
pass
else:
if state.get('label_next'):
if state.get('repeat'):
result = match_by_regex(item.get('text'), order_labels)
if result:
state['label'] = result[0]
else:
result = match_by_style_and_size(item, order_labels, style='bold')
if result:
state['label'] = result[0]
state['repeat'] = None
state['reset'] = None
state['label_next'] = result[1].get("next")
state['body'] = result[1].get("size") == 16
if state['body'] and re.search(r"^(refer)", item.get('text').lower()):
state['body'] = False
state['back'] = True
if not result:
result = match_next_label(item, state['label_next'], order_labels)
if result:
state['label'] = result[0]
state['label_next_reset'] = result[1].get("next")
state['reset'] = result[1].get("reset", False)
state['repeat'] = result[1].get("repeat", False)
else:
result = match_by_style_and_size(item, order_labels, style='bold')
if result:
state['label'] = result[0]
state['label_next'] = result[1].get("next")
if state.get('body') and re.search(r"^(refer)", item.get('text').lower()):
state['body'] = False
state['back'] = True
else:
result = match_by_style_and_size(item, order_labels, style='italic')
if result:
state['label'] = re.sub(r"-\d+", "", result[0])
state['label_next'] = result[1].get("next")
else:
result = match_by_regex(item.get('text'), order_labels)
if result:
state['label'] = result[0]
else:
result = match_paragraph(item, order_labels)
if result:
state['label'] = result[0]
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In create_labeled_object2(), result is forced to a non-empty dict at line 700 and then the else: branch (which contains most of the labeling logic) becomes unreachable because of if result: pass. This makes the function effectively label everything as <p> unless it matches the section/subsection checks. Rework the control flow so the detailed matching logic can run when appropriate.

Copilot uses AI. Check for mistakes.
obj['type'] = 'aff_paragraph'

if re.search(r"^(translation)", item.get('text').lower()):
state['label'] = '<translate-fron>'
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state['label'] = '<translate-fron>' looks like a typo (missing 't') and will produce a label that doesn't match the choices (<translate-front>). Use the correct label string so downstream logic can recognize it.

Suggested change
state['label'] = '<translate-fron>'
state['label'] = '<translate-front>'

Copilot uses AI. Check for mistakes.
Comment thread model_ai/llama.py
Comment on lines +101 to +103
response_gemini = model.generate_content(user_input).text
time.sleep(15)
return response_gemini
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.sleep(15) after every Gemini call will throttle all reference processing and can tie up Celery workers even when the request succeeds. Consider removing the unconditional sleep and instead implement retry/backoff only when Gemini returns rate-limit/transient errors (e.g., 429/503), ideally with jitter.

Copilot uses AI. Check for mistakes.
Comment thread markup_doc/models.py
Comment on lines +81 to +89
def update(cls, title, estatus):
try:
obj = cls.get(title=title)
except (cls.DoesNotExist, ValueError):
pass

obj.estatus = estatus
obj.save()
return obj
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In update(), if get() raises DoesNotExist, the exception is swallowed and obj is left undefined, but the code still tries to set obj.estatus. Either re-raise/return early when not found, or create the object as appropriate.

Copilot uses AI. Check for mistakes.
Comment on lines +238 to +255
if is_numPr:
numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap)
numId = numPr.find('.//w:numId', namespaces=paragraph.nsmap).get(namespaces_p + 'val')
type = [(key, objt) for key, objt in list_types.items() if objt['numId'] == numId]

#Es una lista diferente
if numId != current_num_id:
current_num_id = numId
if len(current_list) > 0:
current_list.append('[/list]')
objl = {}
objl['type'] = 'list'
objl['list'] = '\n'.join(current_list)
current_list = []
content.append(objl)
list_type = 'bullet'
if type[0][1][str(0)] == 'decimal':
list_type = 'order'
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_numbering_info() can return None when word/numbering.xml is missing, but extractContent() unconditionally does list_types.items() and later indexes type[0]. This will raise at runtime for DOCX files without numbering or with unexpected numId mappings; handle list_types is None and the empty-match case before using it.

Copilot uses AI. Check for mistakes.
else:
obj['spacing'] = False

clean_text = clean_labels(child.text)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean_labels(child.text) will fail when child.text is None (common for <w:r> elements which usually contain <w:t> children). This will raise a TypeError in re.sub. Extract text from the run's <w:t> nodes (or guard against None) before calling clean_labels.

Suggested change
clean_text = clean_labels(child.text)
run_text_nodes = child.xpath('.//w:t/text()', namespaces=child.nsmap)
raw_text = ''.join(run_text_nodes) if run_text_nodes else (child.text or '')
clean_text = clean_labels(raw_text)

Copilot uses AI. Check for mistakes.
Comment thread markup_doc/tasks.py
Comment on lines +31 to +32
def get_labels(title, user_id):
article_docx = UploadDocx.objects.get(title=title)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UploadDocx.objects.get(title=title) relies on non-unique titles and can raise MultipleObjectsReturned / select the wrong row. Prefer passing a primary key to the task and fetching by pk.

Suggested change
def get_labels(title, user_id):
article_docx = UploadDocx.objects.get(title=title)
def get_labels(upload_docx_id, user_id):
article_docx = UploadDocx.objects.get(pk=upload_docx_id)

Copilot uses AI. Check for mistakes.
# FIXME: This function always fetches the first LlamaModel instance.
model_ai = LlamaModel.objects.first()

if model_ai.api_key_gemini:
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_llm_model_name() assumes a LlamaModel row always exists; if the table is empty, model_ai will be None and model_ai.api_key_gemini will raise. Guard with if model_ai and model_ai.api_key_gemini: (and decide on a sensible default when it is None).

Suggested change
if model_ai.api_key_gemini:
if model_ai and model_ai.api_key_gemini:

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants